Being Bayesian in the 2020s: opportunities and challenges in the practice of modern applied Bayesian statistics

Building on a strong foundation of philosophy, theory, methods and computation over the past three decades, Bayesian approaches are now an integral part of the toolkit for most statisticians and data scientists. Whether they are dedicated Bayesians or opportunistic users, applied professionals can now reap many of the benefits afforded by the Bayesian paradigm. In this paper, we touch on six modern opportunities and challenges in applied Bayesian statistics: intelligent data collection, new data sources, federated analysis, inference for implicit models, model transfer and purposeful software products. This article is part of the theme issue ‘Bayesian inference: challenges, perspectives, and prospects’.

based on a Bayesian hierarchical model; and Cramb describes an interactive visualization of small area cancer incidence and survival across Australia (the Australian Cancer Atlas) based on a Bayesian spatial model.

Intelligent data collection (a) Overview
The ability to determine and characterize underlying mechanisms in complex systems is paramount to pioneering research and scientific advancement in the modern era. Over the last decade, the rise of data generation from sensor and internet enabled devices has catalysed the advancement of data collection technologies and analysis methods used to extract meaningful information from complex systems. However, the sheer size of these complex systems (e.g.natural ecosystems like the Great Barrier Reef and river networks) and the expense of data collection means that data cannot be collected throughout the whole system. Further, practical constraints like connectivity, accessibility and data storage issues reduce our ability to sample frequently through time. This has led to innovation in statistical methods for data collection, promoting an emerging era of 'intelligent data collection' where data are collected for a particular purpose such as understanding mechanisms for change, monitoring biodiversity and identifying threats or vulnerabilities to long-term sustainability. Bayesian optimal experimental design is one such area of recent innovation.
Bayesian design offers a framework for optimizing the collection of data specified by a design d for a particular experimental goal, which may be to increase precision of parameter estimates, maximize prediction accuracy and/or distinguish between competing models. More specifically, Bayesian design is concerned with maximizing an expected utility, U(d) = Eu(d, θ, y) through the choice of design d within a design space D, while accounting for uncertainty about, for example, the parameter θ ∈ Θ and all conceivable datasets we might observe y ∈ Y. A Bayesian optimal design d * can therefore be expressed as d * = arg max d∈D Eu(d, θ , y) = arg max d∈D Y Θ u(d, θ, y)p(θ, y; d) dθ dy, where p(θ , y; d) defines the joint distribution of θ and y given a design d.
Unfortunately, determining d * can be challenging. Firstly, the utility function u(d, θ, y) typically involves computing some form of expected value with respect to the posterior distribution, which itself is typically analytically intractable. Further, U(d) is itself an expectation taken with respect to the prior-predictive distribution, which is also typically intractable. This means numerical or approximate methods are needed, which may impose substantial compute time and/or require a stochastic approximation. For example, Monte Carlo integration has been proposed as an approach to form an approximation to the expected utility as follows: (m) , y (m) ), (2.1) where θ (m) ∼ p(θ; d) and y (m) ∼ p(y|θ (m) ; d), for some large value of M. Thus, computations involving M different individual posterior distributions are required just to approximate the expected utility of a design. Secondly, d may be high-dimensional, meaning that a potentially large optimization problem needs to be solved for a computationally expensive and noisy objective function. Accordingly, the majority of research in Bayesian design has focused on developing new methods to address one of these two challenges. Below we provide a brief summary of some of the relevant literature from the last 25 years. Since the conception of statistical decision theory [1,2] upon which the decision-theoretic framework of Bayesian design is based [3], there have been numerous strategies presented in the literature to address the above challenges. Curve-fitting methods were proposed by Müller & Parmigiani [4] to the approximate expected utility in equation (2.1). Here, the fitted curve is optimized as a surrogate for the true expected utility to determine the choice of (approximately optimal) design. An alternative simulation-based method proposed by Müller [5] formed the following augmented joint distribution on d, θ and y: h J (d, θ 1:J , y 1:J ) ∝ J j=1 u(d, θ j , y j )p(y j , θ j ; d), where it can be shown that the marginal distribution of d is proportional to U(d). Markov chain Monte Carlo (MCMC) methods were then used to sample from this distribution, and subsequently to approximate the marginal mode of d. Extensions of this approach were given in [6,7] which include adopting a sequential Monte Carlo (SMC) algorithm to more efficiently sample from the augmented distribution as J increases. However, such approaches are limited to low-dimensional design problems (i.e. 3-4 design points) and simple models due to difficulties in sampling efficiently in high dimensions.
Recently, there has been a shift from sampling-based methods to rapid, approximate posterior inference methods. Combined with a Monte Carlo approximation as given in equation (2.1), this has enabled expected utility functions to be efficiently approximated for realistic design problems. This includes those based on complex models (such as nonlinear models) and models for data that exhibit complex dependence structures (such as those with different sources of variability including spatially and between groups). Such approximate inference methods include the Laplace approximation [8] and variational Bayes [9], which have been combined with new optimization algorithms (e.g. the approximate coordinate exchange algorithm; ACE [10]) to solve the most complex and high-dimensional design problems to date.
The most prominent application of Bayesian design methods appears in the clinical trial literature [11]. Recently, this has been exacerbated by the outbreak of COVID-19 where it has been desirable to conduct clinical trial assessments as quickly as possible, with Bayesian (adaptive) designs shown to yield more resource efficient and ethical clinical trials [12,13]. More recently, Bayesian design methods have been proposed as a basis to efficiently monitor large environmental systems like the Great Barrier Reef [14,15]. In the following case study, we show how such methods can be used to form sampling designs to monitor a coral reef system, and extend these methods to provide flexible designs that address major practical constraints when sampling real-world ecosystems.
(b) Case study: sampling windows for coral reef monitoring Coral reefs are biodiversity hot spots for marine species under threat from anthropogenic impacts related to climate change, water pollution and over-exploitation, among other factors [16]. Coral cover is a commonly used indicator to infer the health of coral reef environments [17], where data collection relies on a series of images taken underwater, along a transect (a line across a habitat). Monitoring of coral reef environments is expensive in terms of monetary, human and technological costs, particularly for remote locations. Informative data are critical to support conservation decisions, but with limited resources to invest in monitoring programs, the need is to optimize in-field activities that will result in the intelligent collection of data. Following [18], we consider monitoring submerged shoals which are coral reefs that exist at depths of around 18-40 m below sea level. Data collection at such depths requires unmanned vehicles to be deployed along a design, i.e. a series of transects which specify where images should be collected. However, spatially precise sampling is known to be difficult in deeper reefs due to unpredictable weather and water currents. Therefore, our aim is to provide Bayesian designs that offer flexibility in where transects will be placed while taking into consideration the complex nature of the systems we are monitoring such as the spatial dependence of natural processes.
In order to define a design, we specify the placement of each transect k = 1, . . . , q on the shoal by its midpoint given in Easting and Northing coordinates, i.e. E k and N k , and the angle of the transect, α k , in degrees. Each transect line is expressed as a design point d k = (E k , N k , α k ). The exact sampling locations (equally spaced along the fixed-length transect) are specified as s i . For each transect, we introduce a radius parameter r k > 0 for the purpose of allowing the sampled image locations to disperse by δ 1 , δ 2 iid ∼ Unif(−r k , r k ), i.e. sampling at s + δ. For image i, a number, n i , of randomly selected points on the image are classified as either hard coral or not. Accordingly, the number of points within an image that contain hard coral, y i , is modelled as and Z ∼ N (0, Σ(γ )), for regression parameters β = (β 0 , . . . , β n ) , covariance kernel parameters γ = (γ 1 , . . . , γ m ) for spatially correlated random effect Z, and covariates x i . The priors for β and γ are based on consideration of historical data (depth and depth squared) collected on the shoal. See [18] for further details.
As a basis for improved monitoring, we consider the amount learned from the data regarding parameters of the above model as our goal of data collection. For this, we specify our utility function as the Kullback-Liebler divergence of the posterior from the prior distribution, where larger values suggest the data is more informative with respect to model parameters.
For a computationally efficient approximation of the utility for a given design, we employ a Laplace approximation of the posterior distribution, i.e. an approximation of the form where θ * = arg max θ∈Θ log p(y, θ ; d) and H(θ * ) is the Hessian matrix evaluated at θ * . Here θ = (β, γ ), and marginalization of Z is performed approximately using Monte Carlo integration. To obtain an optimal design to for monitoring of the shoal, we propose a two-step approach (i) Firstly, a global search for the Bayesian optimal design d * = (d * 1 , . . . , d * q ), where q = 3 (the total number of transects) is conducted. We consider a discretized design space, and find designs via a discrete version of ACE; and (ii) Secondly, we form design efficiency windows (illustrating robustness to imprecise sampling) across r k for each transect k = 1, . . . , q. To do so, we specify a zero-mean Gaussian process (GP) prior for the approximate expected utility across r ∈ R q by U(r; d * ), i.e. U(r; d * ) ∼ GP(0, K(·) + ζ 0 I), for some kernel matrix K(·), and ζ 0 > 0. The windows are then obtained as follows: (a) Centre the radius on d * , i.e. the Bayesian design from (i), and specify a maximum value for r k for k = 1, . . . , q; where k is the transect from which image i is obtained, and evaluate the approximate expected utility of the design at locations s i + δ i ; (c) Fit a GP defined on r ∈ R q to the approximate expected utilities; (d) Emulate the expected utility surface across values of r using the posterior predictive mean of the GP, denotedŪ(r); (e) Normalize the predicted expected utility values by that of the original Bayesian design as follows: , are illustrated as black transect lines (b). Sampling windows are formed around these transects allowing for flexibility in sampling locations while retaining 0.99 of the optimal utility. Design efficiency contours across r ∈ R q are shown (a). (Online version in colour.) and use the above to obtain design efficiency contours (plotted in figure 1a). For some design efficiency contour value c > 0, the corresponding sampling window is the region in space defined by radii r(c) that satisfy eff(r(c)) = c.
Based on the approach, the Bayesian design, d * , shown in figure 1, situates the transects in shallower areas of the reef, but at different depths in the shallow areas, presumably to provide information about the depth effects, β. Avoiding the deeper regions of the shoal makes sense physiologically, as the corals monitored here are photosynthetic organisms, and therefore rely on light to survive. This design thus avoids the collection of data in areas where there is little chance of observing coral. The design efficiency contours are also shown in figure 1. If we consider a design efficiency of 0.99 in equation (2.2), then possible radius values are (50,0,44) for the three transects, with the flexibility (sampling windows) this provides shown around each transect. As can be seen, transect d * 2 is more sensitive than the other two, suggesting more effort should be placed in sampling this transect precisely. In practical terms, sampling from shallow areas of the reef, d * 1 and d * 3 , can be undertaken when the conditions are more unpredictable (e.g.strong currents), and samples from d * 2 can be obtained when field conditions are more preferable.
In conclusion, Bayesian optimal design addresses a fundamental problem in science: the intelligent collection of data resulting in greater information efficiency, reduced sampling cost and improved estimation. Such benefits have been observed in clinical trials [19][20][21][22] and environmental monitoring [15,23], and we have shown how they can be used to offer flexible yet efficient sampling in a real-world context. One limitation of the approach is the potential reliance of designs on a number of assumptions, e.g. an assumed model for the data, so we would encourage future research in areas that reduce this reliance and thus provide more robust designs for data collection.

New data sources
As part of the digital revolution, data from new types of technologies (e.g. VR technology, satellite imagery and in situ sensor data) are becoming available, providing opportunities to gain insights into challenging applied research areas such as environmental conservation. In this section, we describe new sources of data arising from subject elicitation using VR and citizen science (CS), as well as illustrating how Bayesian modelling can be applied to such data for the purposes of informing management decisions in Antarctica and the Australian Great Barrier Reef.

(a) Elicitation using virtual reality
Recent advancements in digital technologies have led to the large-scale collection of more advanced data such as sensor data, satellite imagery, and a host of varied resolution imagery and video including those taken using 360 • cameras. It is possible, using these advances in technology, to enable location-specific data to remote researchers for analysis. The emergence of VR technology, for example, acts as a way to connect the public and the scientific community, creating innovative pathways for environmental conservation research by immersing subjects in an otherwise inaccessible vivid virtual scene for elicitation purposes [24,25]. The opinions and knowledge extracted from this process is itself new data, which can be used for educational purposes [26] or incorporated into statistical models.
Increases in the volume of more complex types of data has led to the development of more effective and efficient analysis methodology. Recently, Bayesian models have seen use as a method to evaluate subject elicitation in the areas of coral reef conservation [27], jaguar and koala habitat suitability assessments [28,29], and the aesthetic value of sites in the Antarctic Peninsula.
(i ) Case study: quantifying aesthetics of tourist landing sites in the Antarctic Peninsula In the Antarctic Peninsula, the effects of climate change and the associated increase of ice-free areas are threatening the fragile terrestrial biodiversity [30]. As well as high ecological importance, these ecosystems also have a unique aesthetic value which has been formally recognized in Article 3 of the Protocol on Environmental Protection to the Antarctic Treaty [31]. There is value in protecting beautiful landscapes, as tourism in Antarctica is based largely on the natural beauty of the environment. This case study quantifies aesthetic values in the Antarctic Peninsula by recording elicitation from subjects immersed in a VR environment using a state-of-the-art web-based framework R2VR [32].
Subject elicitation in this case study is drawn from 16 photos, obtained via 360 • photography at tourist landing sites in the Antarctic Peninsula. Consultation produced landscape characteristics of interest, e.g. the presence of certain animals and the weather. These characteristics and images were then used to construct an interview, to be held while the subject was immersed in the VR environment, with responses recorded on the Likert scale, from strongly disagree to strongly agree. From this elicitation process, responses to each question are recorded for each scene presented to the participant, as well as their opinion of the aesthetic value of the scene itself. Additionally, general participant characteristics such as gender identity and age are also recorded. A Bayesian hierarchical model is used for modelling the response of whether or not a subject i determines scene j (j = 1, . . . , o) as aesthetically pleasing (y ij ) as a function of responses to statements such as 'there are animals in this image' and 'this image is monotonous' (x ik , k = 1, . . . , m), subject characteristics such as age and gender (x ih , h = m, . . . , m + n), and subjectreported confidence in their response to each interview statement (s ij , j = 1, . . . , m), where zero represents low confidence and one represents high confidence. The model is value. This case study is among the first to propose the incorporation of aesthetic value into conservation plans by leveraging subject-reported uncertainty. Understanding aesthetic attributes in Antarctica can be applied to other regions, especially through the implementation of similar surveys and models. The landscape of VR data assets continues to expand as more researchers are made aware of the value added to methods of subject inquiry by including multi-modal features such as text, sounds and haptic feedback. Modern Bayesian modelling approaches allow insights to be drawn from these such novel approaches to data collection.
(b) Citizen science CS represents one of the most popular emerging data sources in scientific research. CS involves engaging members of the general population in one or more parts of the scientific process. Its applications can be found across almost all disciplines of science, especially in ecology and conservation where scientists are harnessing its power to help solve critical challenges such as climate change and the decline in species abundance. Examples of citizen scientists' contributions include reporting sightings of species, measuring environmental variables and identifying species on images. Hundreds of CS projects can be found in popular online platforms including Zooniverse [33], eButterfly [34], eBird [35] and iNaturalist [36]. A fundamental issue often discussed surrounding CS is the quality of the data produced, which is generally errorprone and biased. For example, bias can arise in CS datasets due to (i) the unstructured nature of the data, (ii) collecting data opportunistically, with more observations from frequently visited locations [37] or at irregular frequencies across time [38], and (iii) as a result of differing abilities of the participants to perform tasks such as detecting or identifying species [39,40]. However, recent advances in statistics, machine learning and data science are helping realize its full potential and increase trustworthiness [40][41][42].
Frequently, CS data are elicited via image classification. For example, asking the participants whether images contain a target class or species. In this section, we illustrate two modelling approaches for these types of data.
In the first approach, we consider a binary response variable y ij representing whether the category has been correctly identified by the participant (i = 1, . . . , m) in the image (j = 1, . . . , n). The probability of obtaining a correct answer can be modelled using an item response model such as the three-parameter logistic model (3PL) [40,43], where each η j ∈ (0, 1) is a pseudo-guessing parameter accounting for a participants' chance of answering correctly by guessing, Z i is the latent ability of the ith participant, α j > 0 is the slope parameter and B j is the latent difficulty of the jth image. Sometimes, the correct answer for certain images is unknown. In this case, we estimate the latent labels for the images via the estimates of Z i , by using the latter as weights in popular methods such as majority or consensus voting. Code to fit these models and exemplar datasets can be found in [44]. The second approach is for the case where we are interested in the proportion of species in elicitation points in images. Here, we compute a statisticŷ ij ∈ [0, 1] giving the apparent proportion of species in a number of elicitation points in image j classified by the participant i. The true latent proportion Y j can be estimated based on each participant's overall performance measures se i and sp i (which denote the sensitivity and specificity scores of participant i, respectively). A Beta prior is placed on the true proportion, yielding the model where α j and β j are the shape and the scale parameters in the beta distribution, respectively. The above model can be parametrized via a specified prior mean μ j for each Y j , and a common precision parameter φ, via α j = μ j φ and β j = −μ j φ + φ, which in turn implies that Covariates can also be incorporated by defining a beta regression with logit(μ j ) = ξ x j + U j + ε j , where ε j are error terms, and U j are spatially dependent random effects. Both approaches account for spatial variation (captured in B j or U j for the first and second approach, respectively) using different spatial structures (e.g. conditional autoregressive (CAR) priors, covariance matrices, or Gaussian random fields). See more details in [40,42,45].
The following case study illustrates the estimation of the latent proportion of hard corals across the Great Barrier Reef in Australia, obtained from underwater images classified by citizen scientists. Figure 2 shows 15 spatially balanced random points in one of the images used in the study. The apparent proportion of hard coral in the image was obtained using the number of points selected by participants containing this category out of 15. Using equation (3.2), the (biased) estimates obtained from the citizen scientists can be corrected producing a similar density to the latent unobserved proportions.
The integration of CS data with current monitoring efforts from Australian federal agencies and non-governmental organizations is a breakthrough to increase the amount of information about changes along the Great Barrier Reef, learn about climate change impacts and adapt management actions consequently. This model introduced here is the root of a digital platform that estimates the health of the Great Barrier Reef using all available information. This study contributes to increasing the trust of CS and produce reliable data for environmental conservation while engaging and arising awareness about coral reefs.

Federated analyses and inference methods (a) Overview
In many areas of study (health, business and environmental science, for example), the collective dataset of interest one wishes to use in modelling is often under the control of different data custodians, i.e. parties responsible for ensuring data is only used or released in instances deemed appropriate to governance requirements. Such requirements often stipulate that the data itself, and information pertaining to it can only be shared in a manner deemed sufficiently private.
Federated learning is the process of fitting a model in the setting where data resides with multiple custodians. Approaches typically place privacy as having the utmost importance, but computational efficiency is important from a practical perspective. There are two broad data settings that occur, each requiring their own style of algorithm. The first is the horizontal setting. Here, multiple data custodians have the same set of variables for different entities. By contrast, the vertical federated learning setting has data custodians who possess different variables for the p(θ|y (2) ) calculate p(θ|y (1) ) calculate p(θ|y (2) ) by h(θ t , g (1) , g (2) , g (3) ) p(θ|y (3) ) approximate p (θ_y (1) , y (2) , y (3) ) p (θ_y (1) ), p(θ_y (2) ), p(θ_y (3) ) by f θ t+1 θ t+1 θ t+1 g (1) g (2) g (3) for t [1,...,T] (a) ( b)  [46]. FedAvg involves updating parameter values of a global model to be the weighted average of parameter values obtained by updating the same model locally (possibly many times) at each iteration. This work led to many related optimization algorithms, e.g. FedProx [47], and FedNova [48] which account for heterogeneous (non i.i.d.) data sources, and the Bayesian nonparametric approach for learning neural networks of [49], where local model parameters are matched to a global model via the posterior distribution of a Beta-Bernoulli process [50]. To date, practical federated analyses appear restricted to the frequentist setting. Examples include the prediction of breast cancer using distributed logistic regression [51] and modelling of the survival of oral cavity cancer through a distributed proportional Cox hazards model [52]. Both these approaches conduct parameter estimation via a Newton-Raphson algorithm [53] and result in equivalent maximum-likelihood estimates to those obtained in a standard, non-federated setting. Algorithms for the maximum-likelihood estimation of log-linear and logistic regression models in vertical federated learning settings [54][55][56][57][58] use ideas such as secure multiparty computation [59], and formulating the parameter estimation task as a dual problem [60]. Several overarching software infrastructures such as VANTAGE6 [61] ensure the correct and secure use of the data of each custodian within the specified algorithm, given acceptable (model-and application-specific) rules for information exchange. Despite the potentially enabling capabilities of federated methods, to our knowledge, Bayesian federated learning methods have yet to impact real-world applications. In the Bayesian inference setting, the 'learning' task becomes one of performing posterior inference, e.g. via MCMC or variational inference techniques. Note that Bayesian federated learning approaches may involve multiple communication rounds, though this is only sometimes the case. For example, many distributed MCMC approaches (e.g. [62][63][64]), combine individually-fit model posteriors, requiring only a single communication step from each local node. A recent intermediate approach [65] is to construct a surrogate likelihood of the complete dataset via an arbitrarily specified number of communication steps. After constructing the surrogate likelihood, an MCMC algorithm is run on a single device. As the number of communication steps increases, the approximation error introduced by the surrogate likelihood decreases. Figure 3 illustrates the difference between post hoc posterior amalgamation strategies and collaborative multi-round approaches.
In certain cases, carrying out federated Bayesian inference is (at least in principle) relatively straightforward. For example, a naive MCMC algorithm would be trivial to construct for a simple Hence, for the horizontal setting, all that is required is the nodes sharing the sum of their respective log-likelihood terms with the server. However, this approach would require a minimum of two communication steps per iteration of the Markov chain. Recent MCMC methods, similar in style to the FedAvg algorithm (which use Langevin dynamics to update the Markov chain), require only a single communication step per iteration [66,67]. Such approaches exploit gradient information which decomposes as a sum similarly to (4.1), though eschew the usual Metropolis-Hastings correction and are hence asymptotically inexact. In some instances, a formally justified notion of privacy may be required, as opposed to simply an intuitive one given by aggregation of terms. Differential privacy (DP) (e.g. [68]) provides such guarantees, and there are variants of MCMC that ensure this, such as DP-MCMC [69], which accomplishes privacy guarantees at the cost of a slight perturbation of stationary distribution of the chain. It is worth noting that all of the above examples mentioned are specific to the horizontal setting, with the vertical setting proving especially challenging as one does not have a beneficial decomposition like that of (4.1).
As the above alludes to, the development and use of Bayesian federated learning algorithms are complex for several reasons. A method is only suitable for a prescribed application if it satisfies a combination of requirements, such as being able to work with the desired model, computational and communication costs, privacy and accuracy. For each application, the choice of model and federated method will depend on where the priorities lie, e.g. accuracy, efficiency or privacy. In some cases, there may be no feasible algorithm (an example is given in the upcoming case study). Thus, inference approaches that improve upon some (or even all) of these aspects are important and warrant future research.
The ultimate goal of federated Bayesian analysis is to circumvent the need for data merging [70,71] in scenarios where merging is considered infeasible. However, for Bayesian federated learning to reach this point, these approaches must offer custodians and interested parties an accurate inference for complex models while maintaining a level of privacy acceptable to those data custodians. Thus, the methodological development that enables federated inference for more advanced Bayesian models efficiently and/or with additional privacy guarantees is likely to emerge as a critical area of interest in the coming years.

(b) Case study: federated learning with spatially dependent latent variables
The greatest hindrance to employing federated learning in real-world applications is the lack of possible model types that current algorithms address. Commonly, applied statistical modelling involves incorporating hierarchical structures, and latent variables [72]. To our knowledge, there are no federated Bayesian analysis algorithms at all for such models. To briefly illustrate the unique challenges and the need for developments that account for the nuances of different models, the case study considers spatially dependent latent variables based on neighbourhood structures. For simplicity, the focus is on the Intrinsic Conditional AutoRegressive (ICAR) prior [73], although variations such as the Besag-York-Mollie [74] and Leroux [75] models are similar in what follows (the latter is used for example, in the Australian Cancer Atlas described in §7). The ICAR prior posits a vector of spatially dependent latent variables, denoted here as Z. Each element of Z corresponds to a latent area-level effect of a 'site', which is influenced by neighbouring sites. Writing i ∼ j to denote that sites i and j are considered neighbours, and assuming the graph arising from the neighbourhood structure is fully connected, the ICAR prior with precision hyperparameter τ has log-density The above may be problematic if the data custodians insist that the latent variables corresponding to their areas must be kept private to themselves. To see why, consider the case that there are two (2) data custodians, with the sets C 1 and C 2 containing the indices of data possessed by the first and second custodian, respectively. Then, where terms in blue are those relevant to the sites under the first custodian, and those in red to the second. When computing the log-posterior density (as required, for example, in Markov chain Monte Carlo algorithms), the first two terms on the right-hand side above can be aggregated and sent to the central server. However, the final term cannot as each individual summand requires the individual latent variables to be processed. This is because the latter term considers interactions across custodian boundaries.
Consequently, solutions such as (i) employing judicious reparameterization of the latent variables (possibly compatible with the one that is often also required to enforce identifiability), (ii) changing the model to add additional auxiliary variables or (iii) otherwise approximating the troublesome term, are required. An additional challenge is that even if inferences on individual latent variables are only available to their respective custodians, they may nevertheless 'leak' information across custodian boundaries to neighbouring sites due to the underlying dependency structure.
While the above certainly highlights particular challenges, the first two terms of (4.2) split nicely across custodians and hint that latent variables need not always be problematic. For more straightforward cases such as the latter, specialized accurate and efficient inference approaches that allow individual custodians to avoid ever sharing their latent variables (either directly or indirectly) or data are the subject of forthcoming work by the authors of this section, who have a longer-term goal of tackling more challenging cases such as ICAR and its relatives in different settings.

Bayesian inference for implicit models (a) Overview
The appetite for developing more realistic data-driven models of complex systems is continually rising. Development of such complex models can improve our understanding of the underlying mechanisms driving real phenomena and produce more accurate predictions. However, the calibration of such models remains challenging, as the associated likelihood function with complex models is often too computationally cumbersome to permit timely statistical inferences. Despite this, it is often the case that simulating the model is orders of magnitude faster than evaluating the model's likelihood function. Such models with intractable likelihoods that nevertheless remain simulable are often referred to as implicit models. Such models are now prevalent across many areas of science (see e.g. various application chapters in [76]).
Currently, the most popular statistical approach amongst practitioners for performing Bayesian inference for implicit models is approximate Bayesian computation (ABC), popularized by Beaumont et al. [77]. A related method called generalized likelihood uncertainty estimation [78,79] predates ABC, and [80] explore its connections to ABC. The ABC approach approximates the true posterior as Here, x denotes simulated data that has the same structure as y, || · || is some norm (i.e. ||x − y|| measures the closeness of the simulated data to the observed data), and stipulates what is considered 'close'. Intuitively, values of θ more likely to produce simulated data x close enough to y have increased (approximate) posterior density. Rather than compare y and x directly, it can be more efficient to compare y and x in a lower dimensional space via a summarization function that aims to retain as much information from the full dataset as possible. For the posterior in (5.1) to equate to the exact posterior, we require that the observed and simulated datasets are matched perfectly (i.e. as → 0) in terms of some sufficient summarization. However, in the majority of practical applications, a low-dimensional sufficient statistic does not exist and it is computationally infeasible to take → 0, so we must accept some level of approximation. Given the wide applicability of the approach, i.e. that only the ability to simulate the model is required to conduct inference, there has been an explosion of research in the past 10-15 years advancing ABC and related methods that lie within the more general class of so-called likelihood-free inference methods. A substantial portion of methodologically focused ABC research considers aspects including the effective choice of || · || (e.g. [81,82]), efficient sampling algorithms to explore the approximate posterior in (5.1) (e.g. [83]) and ABC's theoretical properties (e.g. [84]). Many of the developments of ABC and some related methods (e.g. Bayesian synthetic likelihood [85,86]) prior to 2018 are discussed in [76], the first-ever monograph on ABC.
The following case study considers a popular class of sampling algorithms for ABC based on SMC. SMC-based ABC algorithms improve efficiency compared to sampling naively from the prior by gradually reducing the ABC tolerance where the output produced at iteration t is used to improve the proposal distribution of θ at iteration t + 1. The output of the algorithm is N samples, or 'particles', from the ABC posterior in (5.1) with a final that is either pre-specified or determined adaptively by the algorithm. Each particle has attached to it a 'distance', which is the value of ||x − y|| for x simulated from the model based on the particle's parameter value. Here, we use the adaptive SMC-ABC algorithm in [87], itself a minor modification of the replenishment algorithm of [88]. The algorithm is summarized below.
(i) Draw N samples from the prior, and for each sample, simulate the model and compute the corresponding distance. Initialize as the largest distance among the set of particles. (ii) Set the next as the α-quantile of the set of distances. Retain the N α particles with distance less than or equal to . (iii) Resample the retained particle set N − N α times so that there are N particles. (iv) Run MCMC on each of the resampled N − N α particles with stationary distribution (5.1) with the current . This step helps to remove duplicate particles created from the previous resampling step. The number of MCMC iterations can be adaptively set based on the MCMC acceptance rate. (v) Repeat steps (ii)-(iv) until a desired is reached or the MCMC acceptance rate in step (iv) is too small (i.e. the number of MCMC steps becomes too large for the computational budget).
A key computational inefficiency of ABC and closely related methods such as BSL is that many of the model simulations yield MCMC proposals that are rejected. To obtain a suitable quality of approximation, it is not uncommon to require continuing the algorithm past the point where is small enough to have average acceptance probabilities of 10 −2 or less. To overcome this issue, there has been significant attention devoted to machine learning based approaches to likelihoodfree inference, especially in the past 5 years. These methods use model simulations (from different parameter values) as training data for building a conditional density estimator of the likelihood (e.g. [89]), likelihood ratio (e.g. [90]) or posterior density (e.g. [91] informed training set that yields a more accurate conditional density estimator in regions of nonnegligible posterior probability. For the case study below, we compare the SMC-ABC approach with the sequential neural likelihood (SNL) method of [89], which is outlined below.
(i) Set the initial proposal distribution of parameter values as the prior, i.e. q(θ) = p(θ).
(ii) Generate a training dataset by drawing M parameter/simulated data pairs according to q(θ)p(x|θ ). Fit a conditional normalizing flow (a flexible type of regression-density estimator) to the training data to estimate the conditional density of X|θ. (iii) Run MCMC to obtain approximate posterior samples, using the learned conditional density of X|θ evaluated at the observed data y as the approximation to the likelihood. Samples from this approximate posterior can also be used to update the proposal distribution q(θ). (iv) Repeat steps (ii) and (iii) for a desired number of rounds.

(b) Case study: calibrating agent-based models of tumour growth
In this case study, we apply likelihood-free methods SMC-ABC and SNL for calibrating a complex agent-based model (ABM) of tumour growth. We briefly compare the methods in terms of computationally efficiency and their ability to fit simulated and real tumour growth data.
ABMs have been used in cancer modelling for some time now as they provide a spatial representation of the inherent cellular heterogeneity and stochasticity of tumours [92][93][94]. Largely, these models account for the individual cell-based behaviours of proliferation, movement and death and aim to predict the impact of stochasticity on spatial tumour growth over time. Previous works have considered this in the context of angiogenesis [95], immune involvement [96,97] and also treatment [98]. In some cases, data have been used to calibrate or validate aspects of the models [99,100], although due to the computational cost and their intractable likelihood it is not always easy to infer parameters in an ABM using data.
For this case study, we use a previously published ABM called a Voronoi cell-based model (VCBM) [101,102]. In this model, cancer cells and healthy tissue cells are considered agents, whose centre is modelled by a point on a two-dimensional lattice, and whose boundary is defined by a Voronoi tessellation. To mimic tumour growth and spatial tissue deformation, the model captures cell movement using force-balance equations derived from Hooke's Law. In this way, cell movement is captured off-lattice and is a function of the local cell-neighbourhood pressure, determined using a Delaunay Triangulation.
Tumour growth is captured by introducing a probability of an individual cancer cell proliferating P, which is a function of a cell's distance to the boundary of the tumour: P = p 0 (1 − (d/d max )), where p 0 is the probability of proliferation, d is the cell's Euclidean distance to the tumour boundary (measured from the cell centre to the nearest healthy cell centre) and d max is the maximum radial distance a cell can be from the boundary and still proliferate. In this way, the model evolves stochastically over time with cells either proliferating or moving in a given timestep. The model also uses g age to define the time taken for a cell to be able to proliferate and uses p psc as the probability of cancer cell invasion. Hence, the model parameter θ to be estimated is (p 0 , p psc , d max , g age ).
To validate the VCBM, we use published in vivo tumour growth measurements for ovarian cancer [103]. In these experiments, tumour volume was recorded by measuring the tumour width and length as perpendicular axis using calipers and then calculating the approximate tumour volume. We simulate the VCBM in two dimensions and calculate the corresponding tumour volume measurements equivalently. We also consider one simulated dataset generated with parameter value θ = (0.2, 10 −5 , 31,114). The datasets are shown as solid black lines in figure 4. The prior distribution on θ is given by p 0 ∼ Beta(1, 1), p psc ∼ Beta(1, 10 4 ), d max ∼ LogNormal(log(30), 1) and g age ∼ LogNormal(log(160), 1), with parameters assumed independent a priori [104].
We run SMC-ABC until around 100 000 model simulations have been generated for each dataset. We use the SBI package [105] to implement SNL with five rounds of 10 000 model  simulations for each dataset. To compare the performance of SMC-ABC and SNL, we compute the posterior predictive distribution for each dataset. For SNL, we choose the round that visually produces the most accurate posterior predictive distribution. We find that for SNL the performance can degrade with increasing rounds in three ovarian cancer datasets. The results are shown in figure 4. It can be seen that SMC-ABC produces posterior predictive distributions that tightly enclose the time series of tumour volumes for three real-world ovarian cancer datasets. It is evident that SNL produces an accurate posterior predictive distribution for the synthetic dataset, with substantially fewer model simulations than that used for SMC-ABC. This result is aligned with other synthetic examples in the literature (e.g. [106]). However, the SNL results for the real data are mixed, and for the three real datasets SMC-ABC produces more accurate posterior predictive distributions. Further, we do not necessarily see an improvement in SNL when increasing the number of rounds (i.e. number of model simulations). We suggest a reason for the potential poor performance of SNL is that the real data are more noisy than what the simulator is able to produce, and this may lead to a poorly estimated likelihood generated by SNL when evaluated at the observed data. By contrast, SMC-ABC produces results that are more robust to this misspecification, albeit at a higher computational cost in terms of model simulations. The potential poor performance of SNL under misspecification, and possible remedies to this problem, require further research (see [107] for the first approach to addressing this problem). In terms of computational cost, SMC-ABC takes approximately 3 h for each dataset. For SNL, it takes approximately 5 min to generate model simulations, approximately 20 min to train the conditional normalizing flow and approximately 3 h to generate approximate posterior samples (using the slice sampler as in [89]) for each round. The C++ and Python codes used in this study are available at [108].

Model transfer
Updating prior beliefs based on data is a core tenet of Bayesian inference. In the Bayesian context, model transfer extends Bayesian updating by incorporating information from a wellknown source domain into a target domain. Consider the scenario where a target domain has insufficient data y T to enable useful inference. Model transfer allows us to borrow information from a source domain with sufficient data y S to improve inference. The transferability problem then is a question of when to transfer information, which information to transfer, and how to transfer this information. This problem appears across several domains, with some solutions exploiting the underlying properties of the source model, while others create informative priors with the source information. Below, we will discuss several different approaches to the model transfer problem. This broad topic is also known as transfer learning in the machine learning literature [109].
Naive updating, which uses all available source information, is a natural starting point to approach model transfer, though it can be detrimental. If the source and target distributions are dissimilar, negative transfer [110] may occur reducing the inference or predictive power from our posterior. Power priors [111] correct for the difference between source and target distributions by flattening the likelihood of the source distribution. This flattening is done by choosing a value φ ∈ [0, 1] and raising the source likelihood to the value of φ which gives where f S (y S |θ ) and f T (y T |θ ) are the source and target likelihood functions, respectively. Naive updating would simply use the value φ = 1. Finding an appropriate value for φ is challenging, intuitively we want to treat this as a latent variable and assign an appropriate prior. Unfortunately, even when both datasets are from the same distribution, the resulting posterior marginal of φ may exhibit only slightly less variance than the chosen prior. This phenomenon is analysed in [112] with illustrative examples. Other approaches attempt to determine an appropriate value of φ by optimization. Different information criteria, from the standard deviance information criterion to more complex penalized likelihood-type criterion, have been used [113] including the marginal likelihood [114] and the pseudo-marginal likelihood [115] which are evaluated using only the target data.
The transfer learning literature has a large number of methods for model transfer, evident by the recent review paper [109]. Many of these methods are specific to neural networks, but some can still be applied to broader classes of statistical models. An example of such a method is described in [116] which uses an ensemble of convolutional neural networks with a majority voting selection step that is easily generalized for use beyond neural networks. Another method, TrAdaBoost.R2 [117,118]  iteratively reweights each data point in the source and target domain to improve the predictive performance of the target model. There are also several methods specific to generalized linear models. These use a variety of approaches to achieve model transfer for generalized linear models including; knockoff filters [120] to identify a subset of the source data to use, scaling the source likelihood function [121,122], and regularization [123,124] to adjust the weight of the source data. Finally, Transfer GPs [125][126][127] attempt to use information from the source kernel to improve model performance on the target domain. This is achieved by pooling the source and target datasets and producing a new joint kernel Above, λ ∈ [0, 1], where λ = 0 indicates no information transfer and λ = 1 complete information transfer. For the interested reader, exemplar code is available via [128]. Current state-of-the-art Bayesian model transfer generalizes naive Bayesian updating but relies on fixed levels of transfer rather than incorporating uncertainty. It is still not clear how one should learn an optimal φ value in this paradigm but we expect future research will address this and use uncertainty more effectively. Moreover, given the interest in model-specific transfer learning, we believe that a Bayesian approach will be useful to develop general methods that are model agnostic.

Purposeful products
A key advantage of Bayesian methods is their ability to assist in decision making, and here three different case studies showcase innovative tools using Bayesian approaches.

(a) CoRiCAL: COVID-19 vaccine risk-benefit calculator
During the 1st year of the COVID-19 pandemic in 2020, border closures, lockdowns and other favourable conditions meant that Australia was spared from the high per capita case numbers and COVID-19-related deaths that were experienced in many other countries. When vaccines became available in February 2021 [129], the low number of COVID-19-related fatalities in Australia was coupled with uncertainty around highly publicized rare adverse-events for the vaccines: thrombosis and thrombocytopenia syndrome from AstraZeneca [130] and myocarditis from Pfizer [131]. This led to high levels of vaccination hesitancy in the general public [132]. Although emerging scientific evidence was increasingly available on the risks of the vaccines and their effectiveness against both becoming infected and becoming severely ill once infected [133,134], compiling and assessing this information from Scientific journals and Government reports is impossible for the majority of the population. Collating this evidence into an easily understood format that could be used by people to make an informed decision on COVID-19 vaccination in the Australian context became crucial.
Bayesian networks [135] are conditional probability models commonly represented as directed-acyclic graphs, with nodes and links representing variables of interest and the interactions between them. Conditional probabilities for the dependent child-nodes are stored in conditional probability tables (figure 5), which determine the probability of a node being in a given state for each possible combination of parent node states. Using Bayes theorem [135], the model calculates the probability of a given outcome for any defined scenario. Bayesian networks are widely used in a range of decision support settings including public health [136,137], environmental conservation [138] and natural resource management [139].
There are several characteristics that make Bayesian networks attractive for an evidencebased COVID-19 risk-benefit calculator. First, the conditional probability tables can be populated from different sources, such as data from government reports, results from scientific studies, or recommendations from experts and advisory committees. Second, the probabilistic output means that the model can respond to user-defined scenarios such as 'how likely is it that I will get sick' rather than just 'will I get sick'. Finally, Bayesian networks are highly interpretable models [140], as they allow exploration of the effect of different observed values (evidence) on the probability of certain outcomes. The COVID-19 risk calculator (CoRiCAL-https://corical.immunisationcoalition.org.au) was developed to help the general public, as well as the doctors advising them, weigh-up the risks and benefits of receiving a COVID-19 vaccination. A Bayesian network model was constructed and parameterized based on the best available evidence from a range of sources that can be used to determine a person's risk of developing symptomatic COVID-19, dying or other adverse effects from COVID-19, or suffering from adverse effects (including death) from the vaccine itself [141]. The model relied on Australian data to represent the context as accurately as possible, however in cases where local data was lacking, international data was used [142,143]. Full model information, along with model code is available via the link [144]. A web-based interface (figure 6) was developed to create a user-friendly tool that considers a person's age and sex, the brand of the vaccine, how many vaccines they have had already, and the current levels of transmission within the community and displays their chances of an adverse event alongside common relatable risks. As the pandemic landscape changes, it remains crucial that the evidence for making informed choices on COVID-19 vaccination is made accessible. The model is updated in light of new variants, and as new vaccines become available and recommended (e.g. booster shots).

(b) ReefCloud: a tool to monitor coral reefs worldwide
Recent projections estimate that 99% of the world's coral reefs will suffer from frequent marine heatwaves under 1.5 • C of warming due to climate change [145]. Important ecological and socioeconomic changes already occur in tropical oceans because of the decline of key corals that support thousands of species [146]. The latter impacts about one billion people whose income, food supply, coastal protection, and cultural practices depend on coral reef biodiversity. Robust estimation of changes in coral communities at large spatial scales is challenging because there are a lack of observations due to the remoteness of coral reefs and the absence of monitoring programs in sea countries. Also, the fine-scale variability of changes in coral cover result in disparate longterm coral trajectories at reef locations situated only few hundred metres apart [147]. For reef managers, these challenges (among others) contribute to slowdown the development of strategies that aim to reduce impacts of climate change on coral reefs.
A spatio-temporal Bayesian model is developed to estimate the coverage of total coral cover across spatial scales and predict coverage values at unsampled locations.  uses outputs from artificial intelligence algorithms trained to classify points on images [148].
For each Marine Ecoregion of the World (MEOW, [149]), a set of images j = 1, 2, . . . , J, each composed of k = 1, 2, . . . , 50 elicitation points is used across years of monitoring. Counts, y it for observation i sampled at location s i and time t, are modelled using a binomial distribution (with p the probability of positive and n i the total number of positive cases) and controlled by additional components including the fixed effects of environmental disturbances (cyclones and mass coral bleaching events), sampling nested design (depth, transect, site, reef and monitoring program) modelled as independent and identically distributed Gaussian random effects, and spatio-temporal random effects.
The novelty in this model is the incorporation of a spatio-temporal random effects composed of a first-order autoregressive process in time and a Gaussian field that is approximated using a Gaussian Markov random field (GMRF), where the covariance is determined by a Matérn kernel. We employed the GMRF representation as a stochastic partial differential equation, using the method of [150], implemented in the R package INLA [151]. The spatial domain is based on the observed data locations and a buffer with adjacent MEOWs to allow information sharing between units. Spatial predictions are estimated at a grid level of 5 × 5 km resolution and posterior distributions used to reconstruct coral cover values at coarser spatial scales including MEOWs units and country level. Finally, estimations of coral cover are weighted by the proportion of coral reefs within a MEOW unit following the methodology developed as part of the global coral reef monitoring network [152]. We use the default INLA priors for different types of model parameters as discussed in [153]. The model is as follows: The priors for the autoregressive parameter φ and independent Gaussian random effects V i used are the INLA defaults. Research efforts focus on developing new technologies to assess the status of coral reefs in rapid and cost-effective ways through automatic image detection [148] and learn about impacts of multiple disturbances and management strategies [154,155]

(c) Australian Cancer Atlas
Cancer is the leading cause of disease burden in Australia, which has comprehensive cancer incidence reporting for all cancers except common skin cancers [156]. Yet because Australia's population is heavily concentrated in specific coastal areas, cancer rates are commonly reported only for large regions. Difficulties when using sparse data for smaller areas include the reliability of estimates and the risk of identifying individuals. Yet, detailed small-area analyses have immense power to identify and understand inequities in cancer outcomes. Using Bayesian hierarchical Poisson models incorporating Leroux priors [157] for spatial smoothing, robust and reliable cancer incidence and 5-year survival estimates were generated across Australian statistical areas level 2 (SA2; 2148 areas). These areas represent communities which interact together and while population sizes vary, the median is around 10 000 people [158]. Innovative visualizations helped rapidly identify areas which differed from the national average. Further details on the methods and visualizations are available in [159]. Example code for the Bayesian spatial models is available in § §7.3 and 9.8.2 of [160].
In September 2018, the Australian Cancer Atlas (atlas.cancer.org.au) was launched, providing the highest geographical resolution nationwide estimates available (figure 8). The website is designed to be interactive and engaging, featuring the ability to download all estimates, export pdfs of specific views, filter regions, rapidly compare different cancer types and rates for two areas, and more! There has been strong uptake and positive feedback, and in 2021 estimates were updated and cancer types expanded.
The Atlas has received prestigious spatial industry awards and is currently being replicated internationally. Australian Cancer Atlas 2.0 is underway, which will examine spatio-temporal  patterns, and further include cancer risk factors, some types of cancer treatment and selected cancer clinical/stage patterns. Underpinned by Bayesian methods, the Atlas will continue to provide the methods and visualizations necessary for accurate estimation, interpretation and making decisions.

Conclusion
This paper has focused on a small number of current opportunities and challenges in the application of the Bayesian paradigm. Of course, these are not the only issues, but collectively they point to the maturity of current Bayesian practice and the promise of a fully mature Bayesian future. As a final thought, we note that many advances in applied Bayesian statistics in recent years are deeply indebted to computational and methodological advances surrounding complex hierarchically structured models. Modern applied Bayesian statistics thus finds itself at the interface with not only its traditional neighbour mathematics, but also increasingly with the field of computer science. This partnership is one of considerable further promise in the years to come. investigation, methodology, software, writing-original draft, writing-review and editing; E.S.-F: formal analysis, investigation, methodology, writing-original draft, writing-review and editing; J.V.: formal analysis, investigation, methodology, software, writing-original draft, writing-review and editing; X.W.: formal analysis, investigation, methodology, software, writing-original draft, writing-review and editing.
All authors gave final approval for publication and agreed to be held accountable for the work performed therein.
Conflict of interest declaration. We declare we have no competing interests.